Model Selection¶
In which we choose the best model to predict the age of a crab.¶
GitHub Repository¶
Notebook Viewer¶
Kaggle Dataset¶
Table of Contents¶
- Define Constants
- Import Libraries
- Load Data from Cache
- Split the Data
- Metrics Used
- Model Exploration
- Naive Linear Regression
- Neural Network Model
- Neural Network Model (32-16-8-1))
- Neural Network Model (16-8-1))
- Neural Network Model (8-1))
- Neural Network Model (4-1))
- Neural Network Model (2-1))
- True vs Predicted Age Scatter Plots
- Training Loss Over Time Plots
- Re-Train the Models Again
- Re-Plot the Training Loss Over Time
- Model Leaderboard
- Choose the Best Architecture for the Job
- Hyperparameter Tuning
- Winner, Winner, Crab's for Dinner!
- Onwards to Feature Engineering
Define Constants¶
%%time
CACHE_FILE = '../cache/splitcrabs.feather'
NEXT_NOTEBOOK = '../2-features/features.ipynb'
MODEL_CHECKPOINT_FILE = '../cache/best_model.weights.h5'
PREDICTION_TARGET = 'Age' # 'Age' is predicted
DATASET_COLUMNS = ['Sex_F','Sex_M','Sex_I','Length','Diameter','Height','Weight','Shucked Weight','Viscera Weight','Shell Weight',PREDICTION_TARGET]
REQUIRED_COLUMNS = [PREDICTION_TARGET]
NUM_EPOCHS = 100
VALIDATION_SPLIT = 0.2
CPU times: total: 0 ns Wall time: 0 ns
Import Libraries¶
%%time
from notebooks.time_for_crab.mlutils import display_df, generate_neural_pyramid
from notebooks.time_for_crab.mlutils import plot_training_loss, plot_training_loss_from_dict, plot_true_vs_pred_from_dict
from notebooks.time_for_crab.mlutils import score_combine, score_comparator, score_model
import keras
keras_backend = keras.backend.backend()
print(f'Keras version: {keras.__version__}')
print(f'Keras backend: {keras_backend}')
if keras_backend == 'tensorflow':
import tensorflow as tf
print(f'TensorFlow version: {tf.__version__}')
print(f'TensorFlow devices: {tf.config.list_physical_devices()}')
elif keras_backend == 'torch':
import torch
print(f'Torch version: {torch.__version__}')
print(f'Torch devices: {torch.cuda.get_device_name(torch.cuda.current_device())}')
# torch supports windows-native cuda, but CPU was faster for this task
elif keras_backend == 'jax':
import jax
print(f'JAX version: {jax.__version__}')
print(f'JAX devices: {jax.devices()}')
else:
print('Unknown backend; Proceed with caution.')
import numpy as np
import pandas as pd
from typing import Generator
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option('mode.copy_on_write', True)
Keras version: 3.3.3 Keras backend: tensorflow TensorFlow version: 2.16.1 TensorFlow devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')] CPU times: total: 375 ms Wall time: 2.68 s
Load Data from Cache¶
In the exploratory data analysis section, we saved the cleaned and split data to a cache file. Let's load it back.
%%time
crabs = pd.read_feather(CACHE_FILE)
crabs_test = pd.read_feather(CACHE_FILE.replace('.feather', '_test.feather'))
display_df(crabs, show_distinct=True)
# split features from target
X_train = crabs.drop([PREDICTION_TARGET], axis=1)
y_train = crabs[PREDICTION_TARGET]
X_test = crabs_test.drop([PREDICTION_TARGET], axis=1)
y_test = crabs_test[PREDICTION_TARGET]
print(f'X_train: {X_train.shape}')
print(f'X_test: {X_test.shape}')
DataFrame shape: (3031, 11)
First 5 rows:
Length Diameter Height Weight Shucked Weight Viscera Weight \
3483 1.724609 1.312500 0.500000 50.53125 25.984375 9.429688
993 1.612305 1.312500 0.500000 41.09375 17.031250 7.273438
1427 1.650391 1.262695 0.475098 40.78125 19.203125 8.078125
3829 1.362305 1.150391 0.399902 25.43750 9.664062 4.691406
1468 1.250000 0.924805 0.375000 30.09375 14.007812 6.320312
Shell Weight Sex_F Sex_I Sex_M Age
3483 13.070312 False False True 12
993 14.320312 True False False 13
1427 5.046875 False False True 11
3829 9.781250 False False True 10
1468 8.390625 False False True 9
<class 'pandas.core.frame.DataFrame'>
Index: 3031 entries, 3483 to 658
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Length 3031 non-null float16
1 Diameter 3031 non-null float16
2 Height 3031 non-null float16
3 Weight 3031 non-null float16
4 Shucked Weight 3031 non-null float16
5 Viscera Weight 3031 non-null float16
6 Shell Weight 3031 non-null float16
7 Sex_F 3031 non-null bool
8 Sex_I 3031 non-null bool
9 Sex_M 3031 non-null bool
10 Age 3031 non-null int8
dtypes: bool(3), float16(7), int8(1)
memory usage: 77.0 KB
Info:
None
Length distinct values:
[1.725 1.612 1.65 1.362 1.25 1.6875 1.487 1.5625 1.4375 1.45 ]
Diameter distinct values:
[1.3125 1.263 1.15 0.925 1.2 1.162 0.8877 0.8374 1.388 1.0625]
Height distinct values:
[0.5 0.475 0.4 0.375 0.4624 0.425 0.4126 0.4375 0.2876 0.2625]
Weight distinct values:
[50.53 41.1 40.78 25.44 30.1 45. 32.03 32.38 30.19 29.34]
Shucked Weight distinct values:
[25.98 17.03 19.2 9.664 14.01 19.66 16.16 16.42 14.13 11.37 ]
Viscera Weight distinct values:
[9.43 7.273 8.08 4.69 6.32 9.52 7.242 6.082 5.29 2.623]
Shell Weight distinct values:
[13.07 14.32 5.047 9.78 8.39 11.195 7.51 8.22 7.98 10.914]
Sex_F distinct values:
[False True]
Sex_I distinct values:
[False True]
Sex_M distinct values:
[ True False]
Age distinct values:
[12 13 11 10 9 8 17 6 19 7]
X_train: (3031, 10)
X_test: (759, 10)
CPU times: total: 0 ns
Wall time: 16 ms
Metrics Used¶
Throughout this notebook, we will use the following metrics to evaluate the regression model:
Mean Squared Error¶
- The best score is 0.0
- Lower is better.
- Larger errors are penalized more than smaller errors.
Mean Absolute Error¶
- The best score is 0.0
- Lower is better.
- Less sensitive to outliers.
Explained Variance Score¶
- The best score is 1.0
- Lower is worse.
R2 Score¶
- The best score is 1.0
- Lower is worse.
From the scikit-learn documentation:
Note: The Explained Variance score is similar to the
R^2 score, with the notable difference that it does not account for systematic offsets in the prediction. Most often theR^2 scoreshould be preferred.
Model Exploration¶
So far, we have not done any feature engineering, which can often be the most important part of the process. Some new features could be constructed from our dataset which would call for a different model. Nonetheless, we can start by using all features to set a baseline.
We will start with a few simple models to get a baseline accuracy.
We will use the following models:
- Naive Random Baseline
- Linear Regression
- Neural Networks
- (64-32-16-8-1)
- (32-16-8-1)
- (16-8-1)
- (8-1)
- (4-1)
- (2-1)
Naive Linear Regression¶
The simplest model is a naive linear regression model. It is untrained and will make random guesses.
%%time
# layer: input
layer_feature_input = keras.layers.Input(shape=(len(X_train.columns),))
# layer: normalizer
layer_feature_normalizer = keras.layers.Normalization(axis=-1)
layer_feature_normalizer.adapt(np.array(X_train))
# layer: output (linear regression)
layer_feature_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> linear
# initialize the all_models dictionary
all_models = {'linear': keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
layer_feature_output])}
all_models['linear'].summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense (Dense) │ (None, 1) │ 11 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 32 (132.00 B)
Trainable params: 11 (44.00 B)
Non-trainable params: 21 (88.00 B)
CPU times: total: 15.6 ms Wall time: 50.1 ms
Configure the Linear Model¶
These will be used for all models unless otherwise specified.
- Optimizer
- Adam: Adaptive Moment Estimation (Kingma & Ba, 2014)
- Loss Function
- Mean Squared Error (MSE)
- This penalizes larger errors more than smaller errors.
- We took out outliers in the data cleaning step, so this should perform better.
- Mean Squared Error (MSE)
- Callbacks
- Model Checkpoint
- Save the best model weights.
- Model Checkpoint
Define Common Compile Options¶
Define Common Checkpoint Options¶
%%time
# some framework
def next_adam(learning_rate:float=0.001) -> Generator[keras.optimizers.Adam, None, None]:
"""Yield the next Adam optimizer with the given learning rate."""
yield keras.optimizers.Adam(learning_rate=learning_rate)
def common_compile_options(
optimizer:keras.Optimizer=None,
loss_metric:str='mean_squared_error'):
"""Return a dictionary of common compile options.
:param optimizer: The optimizer to use. Defaults to Adam with LR=0.001.
:param loss_metric: The loss metric to use. Defaults to 'mean_squared_error'.
"""
return {
'optimizer': optimizer if optimizer is not None else next(next_adam()),
'loss': loss_metric
}
all_models['linear'].compile(**common_compile_options())
common_checkpoint_options = {
'monitor': 'val_loss',
'save_best_only': True,
'save_weights_only': True,
'mode': 'min'
}
linear_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'),
**common_checkpoint_options)
CPU times: total: 0 ns Wall time: 6 ms
Score the Linear Model (Before Training)¶
%%time
untrained_linear_preds = all_models['linear'].predict(X_test).flatten()
# Utility functions imported from mlutils.py
untrained_linear_scores_df = score_model(untrained_linear_preds, np.array(y_test), index='untrained_linear')
# Add it to the leaderboard
leaderboard_df = score_combine(pd.DataFrame(), untrained_linear_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 958us/step CPU times: total: 31.2 ms Wall time: 103 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
%%time
common_fit_options = {
'x': X_train,
'y': y_train,
'epochs': NUM_EPOCHS,
'verbose': 0,
'validation_split': VALIDATION_SPLIT
}
linear_history = all_models['linear'].fit(
**common_fit_options,
callbacks=[linear_checkpoint]
)
all_models['linear'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'))
CPU times: total: 1.98 s Wall time: 7.72 s
Score the Linear Model¶
%%time
linear_preds = all_models['linear'].predict(X_test).flatten()
linear_scores_df = score_model(linear_preds, np.array(y_test), index='linear')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, linear_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 438us/step CPU times: total: 31.2 ms Wall time: 50.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
Neural Network Model¶
Neural Network Architecture¶
We will start with a deep (64-32-16-8-1) neural network with a few layers, gradually reducing the complexity from our overfit model.
- Input Layer
- All of the features, please.
- Normalizer Layer
- Adapted to all features in the training data.
- Hidden Layers
- Four dense layers each with 64 >> {layer_index} units and ReLU activation.
- Output Layer
- Layer with one output.
I know what you're thinking: "Why not start with a simpler model?"
My answer to that: This is for science, and we're going to test them all anyway. It's sometimes easier to copy and delete than it is to build from scratch.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 64, 32, 16, 8
num_hidden_layers = 4
num_units = 64
layer_deepest_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_deepest_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['64_32_16_8_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_deepest_hidden_relu_list,
layer_deepest_output])
all_models['64_32_16_8_1'].summary()
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_1 (Dense) │ (None, 64) │ 704 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_2 (Dense) │ (None, 32) │ 2,080 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_3 (Dense) │ (None, 16) │ 528 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_4 (Dense) │ (None, 8) │ 136 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_5 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 3,478 (13.59 KB)
Trainable params: 3,457 (13.50 KB)
Non-trainable params: 21 (88.00 B)
CPU times: total: 15.6 ms Wall time: 26 ms
Configure the Neural Network Model¶
- Optimizer
- Adam: Adaptive Moment Estimation (Kingma & Ba, 2014)
- Loss Function
- Mean Squared Error (MSE)
- This penalizes larger errors more than smaller errors.
- We took out outliers in the data cleaning step, so this should perform better.
- Mean Squared Error (MSE)
- Callbacks
- Model Checkpoint
%%time
all_models['64_32_16_8_1'].compile(**common_compile_options())
deepest_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'),
**common_checkpoint_options)
CPU times: total: 0 ns Wall time: 1 ms
Train the Neural Network Model¶
We're not going to predict with the untrained model, as we already have a random baseline on the leaderboard.
%%time
deepest_history = all_models['64_32_16_8_1'].fit(
**common_fit_options,
callbacks=[deepest_checkpoint]
)
all_models['64_32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'))
CPU times: total: 2.41 s Wall time: 9.09 s
Score the Neural Network Model¶
%%time
deepest_preds = all_models['64_32_16_8_1'].predict(X_test).flatten()
deepest_scores_df = score_model(deepest_preds, np.array(y_test), index='64_32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deepest_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step CPU times: total: 93.8 ms Wall time: 109 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
Neural Network Model (32-16-8-1)¶
Let's cut the first layer out and see if it still has what it takes.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 32, 16, 8
num_hidden_layers = 3
num_units = 32
layer_32_16_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_32_16_8_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['32_16_8_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_32_16_8_hidden_relu_list,
layer_32_16_8_output
])
all_models['32_16_8_1'].summary()
Model: "sequential_2"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_6 (Dense) │ (None, 32) │ 352 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_7 (Dense) │ (None, 16) │ 528 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_8 (Dense) │ (None, 8) │ 136 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_9 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 1,046 (4.09 KB)
Trainable params: 1,025 (4.00 KB)
Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns Wall time: 19 ms
Configure the (32-16-8-1) Neural Network Model¶
%%time
all_models['32_16_8_1'].compile(**common_compile_options())
deep_32_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'),
**common_checkpoint_options)
CPU times: total: 0 ns Wall time: 2 ms
Train the (32-16-8-1) Neural Network Model¶
%%time
deep_32_16_8_history = all_models['32_16_8_1'].fit(
**common_fit_options,
callbacks=[deep_32_16_8_checkpoint]
)
all_models['32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'))
CPU times: total: 2.77 s Wall time: 8.75 s
Score the (32-16-8-1) Neural Network Model¶
%%time
deep_32_16_8_preds = all_models['32_16_8_1'].predict(X_test).flatten()
deep_32_16_8_scores_df = score_model(deep_32_16_8_preds, np.array(y_test), index='32_16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_32_16_8_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step CPU times: total: 46.9 ms Wall time: 103 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
| 32_16_8_1 | 3.892990 | 1.436031 | 0.174743 | 0.174401 |
Neural Network Model (16-8-1)¶
The last one held up, so let's reduce it even more.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 16, 8
num_hidden_layers = 2
num_units = 16
layer_16_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_16_8_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['16_8_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_16_8_hidden_relu_list,
layer_16_8_output])
all_models['16_8_1'].summary()
Model: "sequential_3"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_10 (Dense) │ (None, 16) │ 176 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_11 (Dense) │ (None, 8) │ 136 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_12 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 342 (1.34 KB)
Trainable params: 321 (1.25 KB)
Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns Wall time: 19 ms
Configure the (16-8-1) Neural Network Model¶
%%time
all_models['16_8_1'].compile(**common_compile_options())
deep_16_8_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'),
**common_checkpoint_options
)
CPU times: total: 0 ns Wall time: 2.51 ms
Train the (16-8-1) Neural Network Model¶
%%time
deep_16_8_history = all_models['16_8_1'].fit(
**common_fit_options,
callbacks=[deep_16_8_checkpoint]
)
all_models['16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'))
CPU times: total: 3.41 s Wall time: 8.32 s
Score the (16-8-1) Neural Network Model¶
%%time
deep_16_8_preds = all_models['16_8_1'].predict(X_test).flatten()
deep_16_8_scores_df = score_model(deep_16_8_preds, np.array(y_test), index='16_8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_16_8_scores_df)
leaderboard_df.head()
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 62.5 ms Wall time: 87.8 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
| 32_16_8_1 | 3.892990 | 1.436031 | 0.174743 | 0.174401 |
| 16_8_1 | 3.791295 | 1.422898 | 0.166290 | 0.165454 |
Neural Network Model (8-1)¶
The last reduction didn't lose too much accuracy, so let's continue removing layers.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 8
num_hidden_layers = 1
num_units = 8
layer_8_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_8_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['8_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_8_hidden_relu_list,
layer_8_output])
all_models['8_1'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 118 (476.00 B)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns Wall time: 12.5 ms
Configure the (8-1) Neural Network Model¶
%%time
all_models['8_1'].compile(**common_compile_options())
deep_8_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'),
**common_checkpoint_options
)
CPU times: total: 0 ns Wall time: 2 ms
Train the (8-1) Neural Network Model¶
%%time
deep_8_history = all_models['8_1'].fit(
**common_fit_options,
callbacks=[deep_8_checkpoint]
)
all_models['8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'))
CPU times: total: 2.45 s Wall time: 7.92 s
Score the (8-1) Neural Network Model¶
%%time
deep_8_preds = all_models['8_1'].predict(X_test).flatten()
deep_8_scores_df = score_model(deep_8_preds, np.array(y_test), index='8_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_8_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 79.5 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
| 32_16_8_1 | 3.892990 | 1.436031 | 0.174743 | 0.174401 |
| 16_8_1 | 3.791295 | 1.422898 | 0.166290 | 0.165454 |
| 8_1 | 3.994874 | 1.469987 | 0.151188 | 0.151183 |
Neural Network Model (4-1)¶
Still not too shabby. Let's reduce the last hidden layer to 4 neurons.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 4
num_hidden_layers = 1
num_units = 4
layer_4_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_4_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['4_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_4_hidden_relu_list,
layer_4_output])
all_models['4_1'].summary()
Model: "sequential_5"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_15 (Dense) │ (None, 4) │ 44 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_16 (Dense) │ (None, 1) │ 5 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 70 (284.00 B)
Trainable params: 49 (196.00 B)
Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns Wall time: 14 ms
Configure the (4-1) Neural Network Model¶
%time
all_models['4_1'].compile(**common_compile_options())
deep_4_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'),
**common_checkpoint_options
)
CPU times: total: 0 ns Wall time: 0 ns
Train the (4-1) Neural Network Model¶
%%time
deep_4_history = all_models['4_1'].fit(
**common_fit_options,
callbacks=[deep_4_checkpoint]
)
all_models['4_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'))
CPU times: total: 3.09 s Wall time: 7.97 s
Score the (4-1) Neural Network Model¶
%%time
deep_4_preds = all_models['4_1'].predict(X_test).flatten()
deep_4_scores_df = score_model(deep_4_preds, np.array(y_test), index='4_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_4_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 82.3 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
| 32_16_8_1 | 3.892990 | 1.436031 | 0.174743 | 0.174401 |
| 16_8_1 | 3.791295 | 1.422898 | 0.166290 | 0.165454 |
| 8_1 | 3.994874 | 1.469987 | 0.151188 | 0.151183 |
| 4_1 | 7.346774 | 2.011863 | 0.148698 | 0.058166 |
Neural Network Model (2-1)¶
The last reduction didn't lose too much accuracy, so let's continue removing layers.
%%time
# layer: input - reused from linear model
# layer: normalizer - reused from linear model
# layer(s): hidden (relu) - 2
num_hidden_layers = 1
num_units = 2
layer_2_hidden_relu_list = generate_neural_pyramid(num_hidden_layers, num_units)
# layer: output (linear regression)
layer_2_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> hidden(s) -> dense
all_models['2_1'] = keras.Sequential([
layer_feature_input,
layer_feature_normalizer,
*layer_2_hidden_relu_list,
layer_2_output])
all_models['2_1'].summary()
Model: "sequential_6"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_17 (Dense) │ (None, 2) │ 22 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_18 (Dense) │ (None, 1) │ 3 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 46 (188.00 B)
Trainable params: 25 (100.00 B)
Non-trainable params: 21 (88.00 B)
CPU times: total: 0 ns Wall time: 15 ms
Configure the (2-1) Neural Network Model¶
%%time
all_models['2_1'].compile(**common_compile_options())
deep_2_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'),
**common_checkpoint_options
)
CPU times: total: 0 ns Wall time: 1 ms
Train the (2-1) Neural Network Model¶
%%time
deep_2_history = all_models['2_1'].fit(
**common_fit_options,
callbacks=[deep_2_checkpoint]
)
all_models['2_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'))
CPU times: total: 2.5 s Wall time: 8 s
Score the (2-1) Neural Network Model¶
%%time
deep_2_preds = all_models['2_1'].predict(X_test).flatten()
deep_2_scores_df = score_model(deep_2_preds, np.array(y_test), index='2_1')
# Add it to the leaderboard
leaderboard_df = score_combine(leaderboard_df, deep_2_scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 81.2 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 13.669806 | 3.097228 | -0.187996 | -2.867148 |
| 64_32_16_8_1 | 3.746227 | 1.420128 | 0.202596 | 0.202460 |
| 32_16_8_1 | 3.892990 | 1.436031 | 0.174743 | 0.174401 |
| 16_8_1 | 3.791295 | 1.422898 | 0.166290 | 0.165454 |
| 8_1 | 3.994874 | 1.469987 | 0.151188 | 0.151183 |
| 4_1 | 7.346774 | 2.011863 | 0.148698 | 0.058166 |
| 2_1 | 7.742756 | 2.080535 | 0.144014 | 0.050772 |
We're finally showing signs of degredation at the (2-1) model. Let's see how they all compare.
True vs Predicted Age Scatter Plots¶
This gives us a good view of how well the model is predicting the age of the crabs.
%%time
all_preds = {
'untrained_linear': {'true': y_test, 'pred': untrained_linear_preds},
'linear': {'true': y_test, 'pred': linear_preds},
'64_32_16_8_1': {'true': y_test, 'pred': deepest_preds},
'32_16_8_1': {'true': y_test, 'pred': deep_32_16_8_preds},
'16_8_1': {'true': y_test, 'pred': deep_16_8_preds},
'8_1': {'true': y_test, 'pred': deep_8_preds},
'4_1': {'true': y_test, 'pred': deep_4_preds},
'2_1': {'true': y_test, 'pred': deep_2_preds}
}
plot_true_vs_pred_from_dict(all_preds, show_target_line=True)
CPU times: total: 31.2 ms Wall time: 53.1 ms
True vs Predicted Age Scatter Plot Observations¶
Neat!
***Note**: The line of truth is shown in green.*
Untrained Linear Model¶
- Very bad.
- As usual.
Linear Model¶
- Guesses are lower than the actual crab ages.
- Older crabs may not be harvested soon enough.
Neural Network Model (64-32-16-8-1)¶
Neural Network Model (32-16-8-1)¶
Neural Network Model (16-8-1)¶
Neural Network Model (8-1)¶
- All looking good.
- Some middle-aged crabs are guessed to be older, but this makes sense since crabs stop growing as much after a certain age.
Neural Network Model (4-1)¶
- Something strange going on here.
- This model is predicting a disproportionate amount of crabs are 5 years old.
Neural Network Model (2-1)¶
- Visually similar to the other neural network models.
- Scores show is making predictions further from the truth.
Training Loss Over Time Plots¶
Now we'll show the training loss over time. This gives us insight into how quickly the model is learning. It can also show us if the model is overfitting or not.
Training loss should decrease over time, but if the validation loss starts to increase, the model is overfitting.
%%time
all_histories = {
'linear': linear_history,
'64_32_16_8_1': deepest_history,
'32_16_8_1': deep_32_16_8_history,
'16_8_1': deep_16_8_history,
'8_1': deep_8_history,
'4_1': deep_4_history,
'2_1': deep_2_history
}
plot_training_loss_from_dict(all_histories)
CPU times: total: 31.2 ms Wall time: 53.1 ms
Training Loss Over Time Observations¶
Pretty cool, huh?
***Note**: These models have some overhead involved in training, so it's not as simple as "more neurons = better". Sometimes a simple ML algorithm can do the trick in milliseconds.*
Linear Model¶
- Never even showed up to the party.
- Exceeds a Mean Squared Error of 10.
Neural Network Model (64-32-16-8-1)¶
- Clearly overfitting already.
- Gets the gist quickly.
Neural Network Model (32-16-8-1)¶
- Looking good.
- Also gets to the gist quickly.
Neural Network Model (16-8-1)¶
- Similar to the (32-16-8-1) model.
- Less variance in the training loss.
Neural Network Model (8-1)¶
- The curve is smoothing out.
Neural Network Model (4-1)¶
- Not as quick to converge.
Neural Network Model (2-1)¶
- Lagging behind.
- Perhaps more epochs will give this model a chance.
Re-Train the Models Again¶
Let's start over, but this time for longer.
Give them 5x as many epochs this time.
Linear Model¶
%%time
# add more epochs
# common_fit_options = {
# 'x': X_train,
# 'y': y_train,
# 'epochs': NUM_EPOCHS*5,
# 'verbose': 0,
# 'validation_split': VALIDATION_SPLIT
# }
common_fit_options['epochs'] = NUM_EPOCHS*5 # give them 5x as many epochs
# reset the linear model
all_models['linear'] = keras.models.clone_model(all_models['linear'])
all_models['linear'].compile(**common_compile_options())
all_histories.update({'linear': all_models['linear'].fit(
**common_fit_options,
callbacks=[linear_checkpoint])})
all_models['linear'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear.weights.h5'))
plot_training_loss(all_histories['linear'], 'Linear Model')
CPU times: total: 7.52 s Wall time: 35 s
Neural Network Model (64-32-16-8-1)¶
%%time
# reset the (64-32-16-8-1) model
all_models['64_32_16_8_1'] = keras.models.clone_model(all_models['64_32_16_8_1'])
all_models['64_32_16_8_1'].compile(**common_compile_options())
all_histories.update({'64_32_16_8_1':
all_models['64_32_16_8_1'].fit(
**common_fit_options,
callbacks=[deepest_checkpoint])})
all_models['64_32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_64_32_16_8_1.weights.h5'))
plot_training_loss(all_histories['64_32_16_8_1'], '64-32-16-8-1 NN Model')
CPU times: total: 8.88 s Wall time: 40.4 s
(64-32-16-8-1) is definitely overfitting. Let's try the next one.
Neural Network Model (32-16-8-1)¶
%%time
# reset the (32-16-8-1) model
all_models['32_16_8_1'] = keras.models.clone_model(all_models['32_16_8_1'])
all_models['32_16_8_1'].compile(**common_compile_options())
all_histories.update({'32_16_8_1':
all_models['32_16_8_1'].fit(
**common_fit_options,
callbacks=[deep_32_16_8_checkpoint])})
all_models['32_16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_32_16_8_1.weights.h5'))
plot_training_loss(all_histories['32_16_8_1'], '32-16-8-1 NN Model')
CPU times: total: 9.72 s Wall time: 38.6 s
(32-16-8-1) still overfitting. Let's keep going.
Neural Network Model (16-8-1)¶
%%time
# reset the (16-8-1) model
all_models['16_8_1'] = keras.models.clone_model(all_models['16_8_1'])
all_models['16_8_1'].compile(**common_compile_options())
all_histories.update({'16_8_1':
all_models['16_8_1'].fit(
**common_fit_options,
callbacks=[deep_16_8_checkpoint])})
all_models['16_8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_16_8.weights.h5'))
plot_training_loss(all_histories['16_8_1'], '16-8-1 NN Model')
CPU times: total: 7.83 s Wall time: 37.4 s
Validation loss is remaining steady, and the training loss is decreasing ever so slightly. It might be overfitting, but it's hard to tell.
Neural Network Model (8-1)¶
%%time
# reset the (8-1) model
all_models['8_1'] = keras.models.clone_model(all_models['8_1'])
all_models['8_1'].compile(**common_compile_options())
all_histories.update({'8_1':
all_models['8_1'].fit(
**common_fit_options,
callbacks=[deep_8_checkpoint])})
all_models['8_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8.weights.h5'))
plot_training_loss(all_histories['8_1'], '8-1 NN Model')
CPU times: total: 8.72 s Wall time: 37.2 s
(8-1) doesn't seem to be overfitting. Let's keep it in mind.
Neural Network Model (4-1)¶
%%time
# reset the (4-1) model
all_models['4_1'] = keras.models.clone_model(all_models['4_1'])
all_models['4_1'].compile(**common_compile_options())
all_histories.update({'4_1':
all_models['4_1'].fit(
**common_fit_options,
callbacks=[deep_4_checkpoint])})
all_models['4_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_4.weights.h5'))
plot_training_loss(all_histories['4_1'], '4-1 NN Model')
CPU times: total: 7.02 s Wall time: 36.7 s
(4-1) Looks pretty good!
Neural Network Model (2-1)¶
%%time
# reset the (2-1) model
all_models['2_1'] = keras.models.clone_model(all_models['2_1'])
all_models['2_1'].compile(**common_compile_options())
all_histories.update({'2_1':
all_models['2_1'].fit(
**common_fit_options,
callbacks=[deep_2_checkpoint])})
all_models['2_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_2.weights.h5'))
plot_training_loss(all_histories['2_1'], '2-1 NN Model')
CPU times: total: 8.72 s Wall time: 36.7 s
Re-Plot Training Loss Over Time¶
Over 500 epochs, we can see how the models are learning.
%%time
plot_training_loss_from_dict(all_histories)
CPU times: total: 31.2 ms Wall time: 50.6 ms
Training Loss Over More Time Observations¶
Cool stuff!
Linear Model¶
- Finally showed up to the party.
- Shares a convergence with the (2-1) model to MSE of ~4.
Neural Network Model (64-32-16-8-1)¶
- Obviously overfitting.
Neural Network Model (32-16-8-1)¶
- Also overfitting.
Neural Network Model (16-8-1)¶
Neural Network Model (8-1)¶
- Similar to the more complex neural networks.
- Less variance as the number of neurons decreases.
Neural Network Model (4-1)¶
- After a bumpy start, it got the hang of it.
- Less variance in the training and validation loss.
Neural Network Model (2-1)¶
- It never caught up.
- But it's not overfitting so much, so that's good.
***Note**: Implementing Early Stopping on these models resulted in early terminations in most cases.*
%%time
# score each model
all_models = {
'linear': all_models['linear'],
'64_32_16_8_1': all_models['64_32_16_8_1'],
'32_16_8_1': all_models['32_16_8_1'],
'16_8_1': all_models['16_8_1'],
'8_1': all_models['8_1'],
'4_1': all_models['4_1'],
'2_1': all_models['2_1']
}
# score on the test set
for model_name, model in all_models.items():
preds = model.predict(X_test).flatten()
scores_df = score_model(preds, np.array(y_test), index=model_name)
leaderboard_df = score_combine(leaderboard_df, scores_df)
# copy untrained linear model scores - random doesn't get another chance here for time's sake
training_leaderboard_df = leaderboard_df.loc[['untrained_linear']]
# score on the training set
for model_name, model in all_models.items():
preds = model.predict(X_train).flatten()
scores_df = score_model(preds, np.array(y_train), index=model_name+'_train')
training_leaderboard_df = score_combine(training_leaderboard_df, scores_df)
leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 914us/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 402us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 460us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 484us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 478us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 466us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 456us/step 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 463us/step CPU times: total: 453 ms Wall time: 1.34 s
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 3.997827 | 1.473956 | 0.011232 | 0.010781 |
| 64_32_16_8_1 | 3.630893 | 1.404292 | 0.302562 | 0.302399 |
| 32_16_8_1 | 3.602257 | 1.385743 | 0.338213 | 0.337592 |
| 16_8_1 | 3.807182 | 1.415440 | 0.280600 | 0.279393 |
| 8_1 | 3.794136 | 1.432214 | 0.228980 | 0.228786 |
| 4_1 | 3.901053 | 1.461622 | 0.178054 | 0.177953 |
| 2_1 | 3.946111 | 1.468480 | 0.044709 | 0.044348 |
Test Set Leaderboard Observations¶
Everyone but the random model did pretty well. Let's see how they did on the training set.
%%time
training_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear_train | 3.958444 | 1.475843 | 0.047926 | 0.047916 |
| 64_32_16_8_1_train | 3.003210 | 1.266670 | 0.452759 | 0.452592 |
| 32_16_8_1_train | 3.251329 | 1.309139 | 0.419088 | 0.418564 |
| 16_8_1_train | 3.341644 | 1.325369 | 0.342692 | 0.342082 |
| 8_1_train | 3.571968 | 1.389613 | 0.268823 | 0.268768 |
| 4_1_train | 3.719555 | 1.432148 | 0.233845 | 0.233813 |
| 2_1_train | 3.923570 | 1.470982 | 0.080701 | 0.080701 |
Training Set Leaderboard Observations¶
Everyone did better, as expected. Hopefully, they didn't do too much better. That would signal overfitting.
Putting it All Together¶
%%time
combined_leaderboard_df = score_combine(leaderboard_df, training_leaderboard_df).sort_index()
combined_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 1e+03 µs
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 16_8_1 | 3.807182 | 1.415440 | 0.280600 | 0.279393 |
| 16_8_1_train | 3.341644 | 1.325369 | 0.342692 | 0.342082 |
| 2_1 | 3.946111 | 1.468480 | 0.044709 | 0.044348 |
| 2_1_train | 3.923570 | 1.470982 | 0.080701 | 0.080701 |
| 32_16_8_1 | 3.602257 | 1.385743 | 0.338213 | 0.337592 |
| 32_16_8_1_train | 3.251329 | 1.309139 | 0.419088 | 0.418564 |
| 4_1 | 3.901053 | 1.461622 | 0.178054 | 0.177953 |
| 4_1_train | 3.719555 | 1.432148 | 0.233845 | 0.233813 |
| 64_32_16_8_1 | 3.630893 | 1.404292 | 0.302562 | 0.302399 |
| 64_32_16_8_1_train | 3.003210 | 1.266670 | 0.452759 | 0.452592 |
| 8_1 | 3.794136 | 1.432214 | 0.228980 | 0.228786 |
| 8_1_train | 3.571968 | 1.389613 | 0.268823 | 0.268768 |
| linear | 3.997827 | 1.473956 | 0.011232 | 0.010781 |
| linear_train | 3.958444 | 1.475843 | 0.047926 | 0.047916 |
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
%%time
clarified_leaderboard_df = leaderboard_df.drop('untrained_linear')[['r2_score', 'explained_variance_score']]
clarified_leaderboard_df.plot(kind='bar', title='Feature-Rich vs Deep Learning Model R2 Scores', figsize=(20, 10))
CPU times: total: 0 ns Wall time: 21.6 ms
<Axes: title={'center': 'Feature-Rich vs Deep Learning Model R2 Scores'}>
Mean Squared Error¶
%%time
clarified_leaderboard_df = leaderboard_df.drop('untrained_linear')[['mean_squared_error', 'mean_absolute_error']]
clarified_leaderboard_df.plot(kind='bar', title='Feature-Rich vs Deep Learning Model MSE Scores', figsize=(20, 10))
CPU times: total: 0 ns Wall time: 21 ms
<Axes: title={'center': 'Feature-Rich vs Deep Learning Model MSE Scores'}>
Score Comparison Observations¶
Neural Network Model (64-32-16-8-1)¶
(64-32-16-8-1) is definitely overfitting.
Neural Network Model (32-16-8-1)¶
(32-16-8-1) is overfitting.
Neural Network Model (16-8-1)¶
(16-8-1) is overfitting.
Neural Network Model (8-1)¶
(8-1) is not overfitting too much.
Neural Network Model (4-1)¶
(4-1) is not overfitting too much either.
Neural Network Model (2-1)¶
(2-1) is not overfitting too much either.
Show the Leaderboard Again¶
%%time
leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 3.997827 | 1.473956 | 0.011232 | 0.010781 |
| 64_32_16_8_1 | 3.630893 | 1.404292 | 0.302562 | 0.302399 |
| 32_16_8_1 | 3.602257 | 1.385743 | 0.338213 | 0.337592 |
| 16_8_1 | 3.807182 | 1.415440 | 0.280600 | 0.279393 |
| 8_1 | 3.794136 | 1.432214 | 0.228980 | 0.228786 |
| 4_1 | 3.901053 | 1.461622 | 0.178054 | 0.177953 |
| 2_1 | 3.946111 | 1.468480 | 0.044709 | 0.044348 |
On Training Data¶
Hopefully they did not do much better than their test counterparts.
%%time
training_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear_train | 3.958444 | 1.475843 | 0.047926 | 0.047916 |
| 64_32_16_8_1_train | 3.003210 | 1.266670 | 0.452759 | 0.452592 |
| 32_16_8_1_train | 3.251329 | 1.309139 | 0.419088 | 0.418564 |
| 16_8_1_train | 3.341644 | 1.325369 | 0.342692 | 0.342082 |
| 8_1_train | 3.571968 | 1.389613 | 0.268823 | 0.268768 |
| 4_1_train | 3.719555 | 1.432148 | 0.233845 | 0.233813 |
| 2_1_train | 3.923570 | 1.470982 | 0.080701 | 0.080701 |
Score These Scores¶
Why not?
These scores will show the level of similarity between the prediction on the test set vs. the training set.
This could be a good way to see if the model is overfitting or underfitting.
%%time
score_score_leaderboard_df = pd.DataFrame()
for model_name in leaderboard_df.index:
if model_name == 'untrained_linear':
continue
score_score_leaderboard_df = score_combine(
score_score_leaderboard_df,
score_model(
leaderboard_df.loc[[model_name]].transpose(),
training_leaderboard_df.loc[[f'{model_name}_train']].transpose(), index=model_name
)
)
score_score_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 39.1 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| linear | 0.001070 | 0.028775 | 0.999628 | 0.999597 |
| 64_32_16_8_1 | 0.114511 | 0.266424 | 0.945298 | 0.937982 |
| 32_16_8_1 | 0.035529 | 0.147345 | 0.982482 | 0.979998 |
| 16_8_1 | 0.058156 | 0.170098 | 0.977551 | 0.971957 |
| 8_1 | 0.013590 | 0.086148 | 0.994594 | 0.993585 |
| 4_1 | 0.010011 | 0.080656 | 0.995934 | 0.995667 |
| 2_1 | 0.000783 | 0.024347 | 0.999759 | 0.999692 |
Choose the Best Architecture for the Job¶
Those pesky crabs don't want us to know how old they are. We'll find out soon enough.
First, let's choose the architecture to tune.
My Criteria¶
- Mean Absolute Error within 2 years.
- Reasonable Explained Variance Score
- Reasonable R2 Score
- Avoid Overfitting
- Reasonable Learning Rate
Based on low MSE, high R2, and high Explained Variance, my choice is the (8-1) neural network architecture.
Pursue the (8-1) Neural Network Architecture¶
Let's try some hyperparameter tuning on the (8-1) neural network model.
Why Not the (4-1) Neural Network Architecture?¶
Despite the (4-1) neural network model performing better over 500 epochs, it has some strange predictions in only 100 epochs.
In the interest of time and hyperparameter tuning, we'll stick with the (8-1) neural network model since it is good and faster to train to an acceptable level.
Hyperparameter Tuning¶
Next, we will tune the hyperparameters of the (8-1) neural network model.
Hyperparameters¶
- Optimizers (adam, nadam, rmsprop, sgd, adagrad, adadelta, adamax)
- Learning rates (0.1, 0.01, 0.001, 0.0001, etc.)
- Loss functions (mean_squared_error, mean_absolute_error, etc.)
Let's reset the number of epochs to the original value.¶
Save some time since not much progress was made with more epochs.
%%time
common_fit_options['epochs'] = 100
CPU times: total: 0 ns Wall time: 0 ns
Optimizer Tuning¶
Next we'll try compiling the (8-1) neural network model with different optimizers to look for any improvements.
We'll try the following optimizers:
- Adam
- Nadam
- RMSprop
- Stochoastic Gradient Descent (SGD)
- Adagrad
- Adadelta
- Adamax
Adam Optimizer¶
We have already been using the Adam optimizer, but let's try it again to get a baseline.
Adam is a popular optimizer that combines the best of Adagrad and RMSprop.
Adam optimization is a stochastic gradient descent method that is based on adaptive estimation of first-order and second-order moments. According to Kingma et al., 2014, the method is "computationally efficient, has little memory requirement, invariant to diagonal rescaling of gradients, and is well suited for problems that are large in terms of data/parameters".
%%time
all_models['8_1_Adam'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adam()
all_models['8_1_Adam'].compile(**compile_options)
deep_8_Adam_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adam.weights.h5'),
**common_checkpoint_options)
# initialize history dictionary
optimizer_histories = {'8_1_Adam': \
all_models['8_1_Adam'].fit(
**common_fit_options,
callbacks=[deep_8_Adam_checkpoint])}
all_models['8_1_Adam'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adam.weights.h5'))
all_models['8_1_Adam'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 314 (1.23 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 196 (788.00 B)
CPU times: total: 1.95 s Wall time: 8.29 s
Adam Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_Adam'], '8-1 NN Model (Adam)')
CPU times: total: 0 ns Wall time: 8.51 ms
Adam Optimizer Score¶
%%time
chosen_arch_preds = {} # initialize prediction dictionary
chosen_arch_preds.update({'8_1_Adam': all_models['8_1_Adam'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adam'], np.array(y_test), index='8_1_Adam')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(pd.DataFrame(), deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 84.1 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.14167 | 0.139598 |
%%time
all_models['8_1_Nadam'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Nadam()
all_models['8_1_Nadam'].compile(**compile_options)
deep_8_Nadam_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Nadam.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_Nadam'] = \
all_models['8_1_Nadam'].fit(
**common_fit_options,
callbacks=[deep_8_Nadam_checkpoint])
all_models['8_1_Nadam'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Nadam.weights.h5'))
all_models['8_1_Nadam'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 2.81 s Wall time: 8.07 s
Nadam Optimizer Training Loss Plot¶
It does seem to converge slightly faster than Adam.
%%time
plot_training_loss(optimizer_histories['8_1_Nadam'], '8-1 NN Model (Nadam)')
CPU times: total: 0 ns Wall time: 8.01 ms
Nadam Optimizer Score¶
%%time
chosen_arch_preds.update({'8_1_Nadam': all_models['8_1_Nadam'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Nadam'], np.array(y_test), index='8_1_Nadam')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 83.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
%%time
all_models['8_1_RMSprop'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.RMSprop()
all_models['8_1_RMSprop'].compile(**compile_options)
deep_8_RMSprop_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_RMSprop.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_RMSprop'] = \
all_models['8_1_RMSprop'].fit(
**common_fit_options,
callbacks=[deep_8_RMSprop_checkpoint])
all_models['8_1_RMSprop'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_RMSprop.weights.h5'))
all_models['8_1_RMSprop'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 217 (876.00 B)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 99 (400.00 B)
CPU times: total: 1.69 s Wall time: 8.1 s
RMSprop Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_RMSprop'], '8-1 NN Model (RMSprop)')
CPU times: total: 15.6 ms Wall time: 9.51 ms
RMSprop Optimizer Score¶
%%time
chosen_arch_preds.update({'8_1_RMSprop': all_models['8_1_RMSprop'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_RMSprop'], np.array(y_test), index='8_1_RMSprop')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 46.9 ms Wall time: 92.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
%%time
all_models['8_1_SGD'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.SGD()
all_models['8_1_SGD'].compile(**compile_options)
deep_8_SGD_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_SGD.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_SGD'] = \
all_models['8_1_SGD'].fit(
**common_fit_options,
callbacks=[deep_8_SGD_checkpoint])
all_models['8_1_SGD'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_SGD.weights.h5'))
all_models['8_1_SGD'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 120 (488.00 B)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 2 (12.00 B)
CPU times: total: 1.03 s Wall time: 7.81 s
SGD Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_SGD'], '8-1 NN Model (SGD)')
CPU times: total: 0 ns Wall time: 7.52 ms
SGD Optimizer Score¶
That training loss is crazy. Hopefully the test scores are better.
%%time
chosen_arch_preds.update({'8_1_SGD': all_models['8_1_SGD'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_SGD'], np.array(y_test), index='8_1_SGD')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 88.7 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
| 8_1_SGD | 4.126909 | 1.485601 | -0.126507 | -0.131760 |
%%time
all_models['8_1_Adagrad'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adagrad()
all_models['8_1_Adagrad'].compile(**compile_options)
deep_8_Adagrad_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adagrad.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_Adagrad'] = \
all_models['8_1_Adagrad'].fit(
**common_fit_options,
callbacks=[deep_8_Adagrad_checkpoint])
all_models['8_1_Adagrad'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adagrad.weights.h5'))
all_models['8_1_Adagrad'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 217 (876.00 B)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 99 (400.00 B)
CPU times: total: 1.7 s Wall time: 8.42 s
Adagrad Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_Adagrad'], '8-1 NN Model (Adagrad)')
CPU times: total: 0 ns Wall time: 7.51 ms
Adagrad Optimizer Score¶
Is it just me, or did Adagrad not learn anything yet?
%%time
chosen_arch_preds.update({'8_1_Adagrad': all_models['8_1_Adagrad'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adagrad'], np.array(y_test), index='8_1_Adagrad')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 0 ns Wall time: 88.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
| 8_1_SGD | 4.126909 | 1.485601 | -0.126507 | -0.131760 |
| 8_1_Adagrad | 14.886531 | 3.162995 | 0.271905 | 0.184537 |
Adadelta Optimizer¶
Adadelta is a good choice for large datasets.
Adadelta optimization is a stochastic gradient descent method that is based on adaptive learning rate per dimension to address two drawbacks:
- The continual decay of learning rates throughout training.
- The need for a manually selected global learning rate.
If its namesake is any indication, it might not do so well here. We'll see.
%%time
all_models['8_1_Adadelta'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adadelta()
all_models['8_1_Adadelta'].compile(**compile_options)
deep_8_Adadelta_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adadelta.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_Adadelta'] = \
all_models['8_1_Adadelta'].fit(
**common_fit_options,
callbacks=[deep_8_Adadelta_checkpoint])
all_models['8_1_Adadelta'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adadelta.weights.h5'))
all_models['8_1_Adadelta'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 314 (1.23 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 196 (788.00 B)
CPU times: total: 1.14 s Wall time: 8.23 s
Adadelta Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_Adadelta'], '8-1 NN Model (Adadelta)')
CPU times: total: 0 ns Wall time: 8.51 ms
Adadelta Optimizer Score¶
It's not looking good for Adadelta based on the training loss plot.
Maybe the scores will redeem it?
%%time
chosen_arch_preds.update({'8_1_Adadelta': all_models['8_1_Adadelta'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adadelta'], np.array(y_test), index='8_1_Adadelta')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 31.2 ms Wall time: 89.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
| 8_1_SGD | 4.126909 | 1.485601 | -0.126507 | -0.131760 |
| 8_1_Adagrad | 14.886531 | 3.162995 | 0.271905 | 0.184537 |
| 8_1_Adadelta | 18.399206 | 3.530084 | 0.263624 | 0.094465 |
%%time
all_models['8_1_Adamax'] = keras.models.clone_model(all_models['8_1'])
compile_options = common_compile_options()
compile_options['optimizer'] = keras.optimizers.Adamax()
all_models['8_1_Adamax'].compile(**compile_options)
deep_8_Adamax_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adamax.weights.h5'),
**common_checkpoint_options)
optimizer_histories['8_1_Adamax'] = \
all_models['8_1_Adamax'].fit(
**common_fit_options,
callbacks=[deep_8_Adamax_checkpoint])
all_models['8_1_Adamax'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_Adamax.weights.h5'))
all_models['8_1_Adamax'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 314 (1.23 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 196 (788.00 B)
CPU times: total: 1.05 s Wall time: 8.1 s
Adamax Optimizer Training Loss Plot¶
%%time
plot_training_loss(optimizer_histories['8_1_Adamax'], '8-1 NN Model (Adamax)')
CPU times: total: 0 ns Wall time: 7.51 ms
Adamax Optimizer Score¶
It held up pretty well despite our initial doubts.
%%time
chosen_arch_preds.update({'8_1_Adamax': all_models['8_1_Adamax'].predict(X_test).flatten()})
deep_model_scores_df = score_model(chosen_arch_preds['8_1_Adamax'], np.array(y_test), index='8_1_Adamax')
# Add it to the leaderboard
optimizer_leaderboard_df = score_combine(optimizer_leaderboard_df, deep_model_scores_df)
optimizer_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 46.9 ms Wall time: 86.1 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
| 8_1_SGD | 4.126909 | 1.485601 | -0.126507 | -0.131760 |
| 8_1_Adagrad | 14.886531 | 3.162995 | 0.271905 | 0.184537 |
| 8_1_Adadelta | 18.399206 | 3.530084 | 0.263624 | 0.094465 |
| 8_1_Adamax | 3.936407 | 1.455304 | 0.117124 | 0.117120 |
%%time
plot_training_loss_from_dict(optimizer_histories)
CPU times: total: 15.6 ms Wall time: 54.1 ms
Optimizer Leaderboard¶
Based on these training loss plots, I'm leaning towards Adam or Nadam.
Let's compare the scores on the leaderboard again before our final decision.
%%time
optimizer_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_Adam | 3.874307 | 1.433025 | 0.141670 | 0.139598 |
| 8_1_Nadam | 3.840628 | 1.450489 | 0.157823 | 0.156640 |
| 8_1_RMSprop | 3.843666 | 1.453355 | 0.216134 | 0.215187 |
| 8_1_SGD | 4.126909 | 1.485601 | -0.126507 | -0.131760 |
| 8_1_Adagrad | 14.886531 | 3.162995 | 0.271905 | 0.184537 |
| 8_1_Adadelta | 18.399206 | 3.530084 | 0.263624 | 0.094465 |
| 8_1_Adamax | 3.936407 | 1.455304 | 0.117124 | 0.117120 |
And the Winner Is...¶
Nadam!¶
Nadam has the best mean and squared errors. Its variance is not as good as Adam, but a crab's age has some wiggle room of a year or two because of how data is collected.
Let's tune the learning rate for Nadam next. We'll create a function with new compile options going forward.
%%time
def nadam_compile_options(learning_rate:float=0.001, loss_metric='mean_squared_error'):
"""Wrapper for common_compile_options with Nadam optimizer.
:param learning_rate: learning rate for Nadam optimizer
:param loss_metric: loss metric for the model. Default is 'mean_squared_error'.
"""
return common_compile_options(
optimizer=keras.optimizers.Nadam(learning_rate=learning_rate),
loss_metric=loss_metric
)
CPU times: total: 0 ns Wall time: 0 ns
%%time
# cloning from Nadam
all_models['8_1_LR_0_1'] = keras.models.clone_model(all_models['8_1_Nadam'])
all_models['8_1_LR_0_1'].compile(**nadam_compile_options(learning_rate=0.1))
deep_8_1_LR_0_1_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_1.weights.h5'),
**common_checkpoint_options
)
# initialize history dictionary
learning_rate_histories = {
'8_1_LR_0_1': all_models['8_1_LR_0_1'].fit(
**common_fit_options,
callbacks=[deep_8_1_LR_0_1_checkpoint])}
all_models['8_1_LR_0_1'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_1.weights.h5'))
all_models['8_1_LR_0_1'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 1.03 s Wall time: 8.12 s
Learning Rate = 0.1 Training Loss Plot¶
We're expecting a quick approximation and a lot of variance.
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_1'], '8-1 NN Model (LR=0.1)')
CPU times: total: 0 ns Wall time: 8.51 ms
Learning Rate = 0.1 Score¶
Yikes! That's a lot of variance. Let's try a slower learning rate next. But first, let's put it on the leaderboard.
%%time
chosen_arch_preds = {'8_1_LR_0_1': all_models['8_1_LR_0_1'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_1'], np.array(y_test), index='8_1_LR_0_1')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(pd.DataFrame(), deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 46.9 ms Wall time: 98.4 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.17987 |
Learning Rate = 0.01 (Less Fast Learning)¶
Still not "slow" learning, but let's decelerate a bit to see if we can address the variance.
%%time
all_models['8_1_LR_0_01'] = keras.models.clone_model(all_models['8_1_Nadam'])
all_models['8_1_LR_0_01'].compile(**nadam_compile_options(learning_rate=0.01))
deep_8_1_LR_0_01_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_01.weights.h5'),
**common_checkpoint_options)
learning_rate_histories['8_1_LR_0_01'] = \
all_models['8_1_LR_0_01'].fit(
**common_fit_options,
callbacks=[deep_8_1_LR_0_01_checkpoint])
all_models['8_1_LR_0_01'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_01.weights.h5'))
all_models['8_1_LR_0_01'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 859 ms Wall time: 8.03 s
Learning Rate = 0.01 Training Loss Plot¶
We're looking for a smoother curve with less variance.
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_01'], '8-1 NN Model (LR=0.01)')
CPU times: total: 0 ns Wall time: 9.51 ms
Learning Rate = 0.01 Score¶
That's more like it. Let's put it on the leaderboard.
%%time
chosen_arch_preds = {'8_1_LR_0_01': all_models['8_1_LR_0_01'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_01'], np.array(y_test), index='8_1_LR_0_01')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 46.9 ms Wall time: 105 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.179870 |
| 8_1_LR_0_01 | 3.712384 | 1.408267 | 0.238712 | 0.237596 |
Learning Rate = 0.001 (Slow Learning)¶
This is the learning rate we've been using, so we know what to expect.
Let's confirm our expectations and see how it compares to the others.
%%time
all_models['8_1_LR_0_001'] = keras.models.clone_model(all_models['8_1_Nadam'])
all_models['8_1_LR_0_001'].compile(**nadam_compile_options(learning_rate=0.001))
deep_8_1_LR_0_001_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_001.weights.h5'),
**common_checkpoint_options)
learning_rate_histories['8_1_LR_0_001'] = \
all_models['8_1_LR_0_001'].fit(
**common_fit_options,
callbacks=[deep_8_1_LR_0_001_checkpoint])
all_models['8_1_LR_0_001'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_001.weights.h5'))
all_models['8_1_LR_0_001'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 1.47 s Wall time: 8.34 s
Learning Rate = 0.001 Training Loss Plot¶
This should look familiar.
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_001'], '8-1 NN Model (LR=0.001)')
CPU times: total: 0 ns Wall time: 7.5 ms
Learning Rate = 0.001 Score¶
Add it to the leaderboard.
%%time
chosen_arch_preds = {'8_1_LR_0_001': all_models['8_1_LR_0_001'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_001'], np.array(y_test), index='8_1_LR_0_001')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 78.1 ms Wall time: 93.6 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.179870 |
| 8_1_LR_0_01 | 3.712384 | 1.408267 | 0.238712 | 0.237596 |
| 8_1_LR_0_001 | 4.019154 | 1.480873 | 0.136125 | 0.136110 |
Learning Rate = 0.0001 (Slower Learning)¶
Let's slow down the learning rate even more. It might take a while to converge, but we expect less variance.
%%time
all_models['8_1_LR_0_0001'] = keras.models.clone_model(all_models['8_1_Nadam'])
all_models['8_1_LR_0_0001'].compile(**nadam_compile_options(learning_rate=0.0001))
deep_8_1_LR_0_0001_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_0001.weights.h5'),
**common_checkpoint_options)
learning_rate_histories['8_1_LR_0_0001'] = \
all_models['8_1_LR_0_0001'].fit(
**common_fit_options,
callbacks=[deep_8_1_LR_0_0001_checkpoint])
all_models['8_1_LR_0_0001'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_0_0001.weights.h5'))
all_models['8_1_LR_0_0001'].summary()
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 2.06 s Wall time: 9.04 s
Learning Rate = 0.0001 Training Loss Plot¶
This should be a slow and steady curve.
%%time
plot_training_loss(learning_rate_histories['8_1_LR_0_0001'], '8-1 NN Model (LR=0.0001)')
CPU times: total: 0 ns Wall time: 8.51 ms
Learning Rate = 0.0001 Score¶
This one is acting as expected. In an ideal world, we would give every model more epochs, but for the sake of time, we'll stick to 100 epochs and consider this 'too slow' for this project.
%%time
chosen_arch_preds = {'8_1_LR_0_0001': all_models['8_1_LR_0_0001'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_0_0001'], np.array(y_test), index='8_1_LR_0_0001')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df.sort_index()[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 92.8 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_0001 | 5.629026 | 1.729137 | 0.024360 | 0.004204 |
| 8_1_LR_0_001 | 4.019154 | 1.480873 | 0.136125 | 0.136110 |
| 8_1_LR_0_01 | 3.712384 | 1.408267 | 0.238712 | 0.237596 |
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.179870 |
Scheduled Learning Rate¶
Let's use what we learned from simulated annealing to schedule the learning rate.
Learning rate (0.01) has the best scores so far.
Our scheduled learning rate can start here and decrease by $X$% every $Y$ epochs of no improvement.
We learned from an earlier experiment that these networks commonly plateau but continue to learn after a while, so we want to give it a chance to learn.
We'll use a ReduceLROnPlateau callback to adjust the learning rate based on the validation loss.
- Factor = 0.75: The factor by which the learning rate will be reduced. new_lr = lr * factor.
- Patience = 9: Number of epochs with no improvement after which learning rate will be reduced.
These values were chosen based on some experimentation (not shown here for brevity).
I wonder if we can schedule the schedule's schedule... (We can, but we won't here.)
%%time
all_models['8_1_LR_S'] = keras.models.clone_model(all_models['8_1_Nadam'])
all_models['8_1_LR_S'].compile(**nadam_compile_options(learning_rate=0.01))
deep_8_1_LR_S_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_S.weights.h5'),
**common_checkpoint_options
)
learning_rate_schedule = keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.75,
patience=9,
verbose=1,
mode='min'
)
learning_rate_histories['8_1_LR_S'] = \
all_models['8_1_LR_S'].fit(
**common_fit_options,
callbacks=[deep_8_1_LR_S_checkpoint, learning_rate_schedule])
all_models['8_1_LR_S'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_LR_S.weights.h5'))
all_models['8_1_LR_S'].summary()
Epoch 75: ReduceLROnPlateau reducing learning rate to 0.007499999832361937. Epoch 89: ReduceLROnPlateau reducing learning rate to 0.005624999874271452.
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 2.08 s Wall time: 7.96 s
Learning Rate Schedule Training Loss Plot¶
Let's look for an improvement in the training loss rate over epochs.
%%time
plot_training_loss(learning_rate_histories['8_1_LR_S'], '8-1 NN Model (LR=Scheduled)')
CPU times: total: 0 ns Wall time: 7.51 ms
Scheduled Learning Rate Score¶
%%time
chosen_arch_preds = {'8_1_LR_S': all_models['8_1_LR_S'].predict(X_test).flatten()}
deep_model_scores_df = score_model(chosen_arch_preds['8_1_LR_S'], np.array(y_test), index='8_1_LR_S')
# Add it to the leaderboard
learning_rate_leaderboard_df = score_combine(learning_rate_leaderboard_df, deep_model_scores_df)
learning_rate_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 15.6 ms Wall time: 85.1 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.179870 |
| 8_1_LR_0_01 | 3.712384 | 1.408267 | 0.238712 | 0.237596 |
| 8_1_LR_0_001 | 4.019154 | 1.480873 | 0.136125 | 0.136110 |
| 8_1_LR_0_0001 | 5.629026 | 1.729137 | 0.024360 | 0.004204 |
| 8_1_LR_S | 3.750417 | 1.413749 | 0.159442 | 0.158437 |
The scheduled learning rate has the best error stats so far. But not so fast, let's take a look at the big picture.
Learning Rate Decision¶
Let's compare the training loss plots and the leaderboard scores for all the learning rates.
Reminder of our criteria:
- Mean Absolute Error within 2 years.
- Reasonable Explained Variance Score
- Reasonable R2 Score
- Avoid Overfitting
- Reasonable Learning Rate
Learning Rate Training Loss Plots¶
%%time
plot_training_loss_from_dict(learning_rate_histories)
CPU times: total: 15.6 ms Wall time: 37.5 ms
Learning Rate Leaderboard¶
Check the leaderboard again before the big decision.
%%time
learning_rate_leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_0_1 | 3.761772 | 1.446253 | 0.183375 | 0.179870 |
| 8_1_LR_0_01 | 3.712384 | 1.408267 | 0.238712 | 0.237596 |
| 8_1_LR_0_001 | 4.019154 | 1.480873 | 0.136125 | 0.136110 |
| 8_1_LR_0_0001 | 5.629026 | 1.729137 | 0.024360 | 0.004204 |
| 8_1_LR_S | 3.750417 | 1.413749 | 0.159442 | 0.158437 |
And the Winner Is...¶
Scheduled Learning Rate!¶
Spending a little extra time on the learning rate paid off. It has the best error scores and an acceptable variance.
Specifically, we are using ReduceLROnPlateau for our schedule.
- Factor = 0.75: The factor by which the learning rate will be reduced. new_lr = lr * factor.
- Patience = 9: Number of epochs with no improvement after which learning rate will be reduced.
Others exist (like ExponentialDecay), but this one worked for us this time.
So far we have chosen the (8-1) neural network architecture with the Nadam optimizer and a scheduled learning rate.
Loss Function to Mean Absolute Error¶
Let's try a different loss function to see if it improves the model.
- Loss Function
- Mean Absolute Error (MAE)
- Less sensitive to outliers.
- Penalizes all errors equally.
- Mean Absolute Error (MAE)
This could be good for our model, as we removed outliers in the data cleaning step. Let's find out.
We'll keep the best architecture so far and change the loss function to MAE.
%%time
all_models['8_1_MAE'] = keras.models.clone_model(all_models['8_1_LR_S'])
all_models['8_1_MAE'].compile(**nadam_compile_options(
learning_rate=0.01,
loss_metric='mean_absolute_error'))
deep_8_1_MAE_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_MAE.weights.h5'),
**common_checkpoint_options
)
loss_histories = {'8_1_MAE':
all_models['8_1_MAE'].fit(
**common_fit_options,
callbacks=[deep_8_1_MAE_checkpoint, learning_rate_schedule])}
all_models['8_1_MAE'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_8_1_MAE.weights.h5'))
all_models['8_1_MAE'].summary()
Epoch 57: ReduceLROnPlateau reducing learning rate to 0.007499999832361937. Epoch 69: ReduceLROnPlateau reducing learning rate to 0.005624999874271452. Epoch 86: ReduceLROnPlateau reducing learning rate to 0.004218749818392098. Epoch 95: ReduceLROnPlateau reducing learning rate to 0.003164062276482582.
Model: "sequential_4"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization (Normalization) │ (None, 10) │ 21 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_13 (Dense) │ (None, 8) │ 88 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_14 (Dense) │ (None, 1) │ 9 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 315 (1.24 KB)
Trainable params: 97 (388.00 B)
Non-trainable params: 21 (88.00 B)
Optimizer params: 197 (792.00 B)
CPU times: total: 1.72 s Wall time: 8.06 s
Lost Function Mean Absolute Error Training Loss Plot¶
***Note**: The loss function is different, so the scale will be different.*
%%time
plot_training_loss(loss_histories['8_1_MAE'], '8-1 NN Model (MAE)', y_lim=(0, 5))
CPU times: total: 0 ns Wall time: 8.52 ms
Loss Function = Mean Absolute Error Score¶
We can't tell anything yet since it's a new scale. Let's check out the leaderboard with all the metrics.
%%time
chosen_arch_preds['8_1_MAE'] = all_models['8_1_MAE'].predict(X_test).flatten()
deep_model_scores_df = score_model(chosen_arch_preds['8_1_MAE'], np.array(y_test), index='8_1_MAE')
# Add it to the leaderboard
loss_leaderboard_df = score_combine(learning_rate_leaderboard_df.loc[['8_1_LR_S']], deep_model_scores_df)
loss_leaderboard_df[:]
24/24 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step CPU times: total: 31.2 ms Wall time: 91.2 ms
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| 8_1_LR_S | 3.750417 | 1.413749 | 0.159442 | 0.158437 |
| 8_1_MAE | 3.897357 | 1.405759 | 0.208332 | 0.201616 |
Mean Absolute Error Loss Function Observations¶
Not the improvement we were looking for. Consistently lagging behind the model trained with Mean Squared Error (MSE) loss.
Perhaps an Ensemble Will Help¶
But I'm running out of time. Let's move on to feature engineering with our best model so far.
Winner, Winner, Crab's for Dinner!¶
Reminder of our criteria:
- Mean Absolute Error within 2 years.
- Reasonable Explained Variance Score
- Reasonable R2 Score
- Avoid Overfitting
- Reasonable Learning Rate
Our Best Model So Far¶
- Architecture: (8-1) Neural Network
- Optimizer: Nadam
- Learning Rate: Scheduled
- Start = 0.01
- Factor = 0.75
- Patience = 9 epochs
- Loss Function: Mean Squared Error
This model should be quick to train to an acceptable level.
%%time
# layer: input
output_as_input_df = pd.concat([X_train, y_train], axis=1)
layer_output_as_input_input = keras.layers.Input(shape=(len(output_as_input_df.columns),))
# layer: normalizer
layer_output_as_input_normalizer = keras.layers.Normalization(axis=-1)
layer_output_as_input_normalizer.adapt(np.array(output_as_input_df))
# layer: output (linear regression)
layer_output_as_input_output = keras.layers.Dense(units=1)
# architecture:
# input -> normalizer -> linear
# initialize the all_models dictionary
all_models = {'linear_add': keras.Sequential([
layer_output_as_input_input,
layer_output_as_input_normalizer,
layer_output_as_input_output])}
# compile options
all_models['linear_add'].compile(**common_compile_options())
# checkpoint options
linear_add_checkpoint = keras.callbacks.ModelCheckpoint(
MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear_add.weights.h5'),
**common_checkpoint_options)
# fit options
all_histories = {
'linear_add':
all_models['linear_add'].fit(
x=output_as_input_df,
y=y_train,
validation_data=(pd.concat([X_test, y_test], axis=1), y_test),
epochs=NUM_EPOCHS,
callbacks=[linear_add_checkpoint])}
all_models['linear_add'].load_weights(MODEL_CHECKPOINT_FILE.replace('.weights.h5', '_linear_add.weights.h5'))
# score the model
preds = all_models['linear_add'].predict(output_as_input_df).flatten()
scores_df = score_model(preds, np.array(y_train), index='linear_add')
leaderboard_df = score_combine(leaderboard_df, scores_df)
# summary
all_models['linear_add'].summary()
Epoch 1/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - loss: 105.6924 - val_loss: 103.0483 Epoch 2/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 820us/step - loss: 101.6268 - val_loss: 99.4652 Epoch 3/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 822us/step - loss: 98.9061 - val_loss: 96.5345 Epoch 4/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 807us/step - loss: 92.7014 - val_loss: 94.0602 Epoch 5/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 879us/step - loss: 93.7745 - val_loss: 91.8188 Epoch 6/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 827us/step - loss: 90.0358 - val_loss: 89.7515 Epoch 7/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 869us/step - loss: 88.9988 - val_loss: 87.7492 Epoch 8/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 896us/step - loss: 86.2676 - val_loss: 85.8454 Epoch 9/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 797us/step - loss: 84.5390 - val_loss: 83.9554 Epoch 10/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 745us/step - loss: 82.5307 - val_loss: 82.1252 Epoch 11/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 733us/step - loss: 80.4340 - val_loss: 80.3183 Epoch 12/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step - loss: 79.4054 - val_loss: 78.5483 Epoch 13/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 727us/step - loss: 77.1459 - val_loss: 76.8206 Epoch 14/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 751us/step - loss: 76.3178 - val_loss: 75.1000 Epoch 15/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 725us/step - loss: 74.9121 - val_loss: 73.4146 Epoch 16/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 727us/step - loss: 72.1253 - val_loss: 71.7520 Epoch 17/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step - loss: 70.0902 - val_loss: 70.1322 Epoch 18/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - loss: 69.9480 - val_loss: 68.5168 Epoch 19/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 724us/step - loss: 67.6865 - val_loss: 66.9430 Epoch 20/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step - loss: 65.6243 - val_loss: 65.4002 Epoch 21/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 747us/step - loss: 64.5850 - val_loss: 63.8779 Epoch 22/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 719us/step - loss: 63.7362 - val_loss: 62.3700 Epoch 23/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 716us/step - loss: 62.4219 - val_loss: 60.8996 Epoch 24/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 724us/step - loss: 60.0935 - val_loss: 59.4564 Epoch 25/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 735us/step - loss: 58.9865 - val_loss: 58.0260 Epoch 26/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 765us/step - loss: 57.8530 - val_loss: 56.6237 Epoch 27/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step - loss: 56.0950 - val_loss: 55.2473 Epoch 28/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step - loss: 54.2189 - val_loss: 53.8912 Epoch 29/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 764us/step - loss: 53.5180 - val_loss: 52.5589 Epoch 30/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 761us/step - loss: 52.7011 - val_loss: 51.2514 Epoch 31/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step - loss: 50.6838 - val_loss: 49.9611 Epoch 32/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step - loss: 49.8138 - val_loss: 48.6989 Epoch 33/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 760us/step - loss: 48.5863 - val_loss: 47.4573 Epoch 34/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 753us/step - loss: 47.1432 - val_loss: 46.2341 Epoch 35/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 779us/step - loss: 46.1972 - val_loss: 45.0253 Epoch 36/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 786us/step - loss: 44.6612 - val_loss: 43.8500 Epoch 37/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 774us/step - loss: 44.1122 - val_loss: 42.6881 Epoch 38/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 754us/step - loss: 42.0755 - val_loss: 41.5520 Epoch 39/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 772us/step - loss: 41.0877 - val_loss: 40.4250 Epoch 40/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step - loss: 40.0457 - val_loss: 39.3242 Epoch 41/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step - loss: 39.3890 - val_loss: 38.2409 Epoch 42/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 731us/step - loss: 37.9221 - val_loss: 37.1801 Epoch 43/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 758us/step - loss: 36.6264 - val_loss: 36.1374 Epoch 44/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 736us/step - loss: 35.8344 - val_loss: 35.1099 Epoch 45/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - loss: 35.1291 - val_loss: 34.0990 Epoch 46/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 723us/step - loss: 34.0102 - val_loss: 33.1134 Epoch 47/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 753us/step - loss: 32.8082 - val_loss: 32.1422 Epoch 48/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 742us/step - loss: 31.7240 - val_loss: 31.1906 Epoch 49/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step - loss: 31.0992 - val_loss: 30.2542 Epoch 50/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 743us/step - loss: 30.1142 - val_loss: 29.3367 Epoch 51/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 741us/step - loss: 28.9096 - val_loss: 28.4392 Epoch 52/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 726us/step - loss: 28.1678 - val_loss: 27.5567 Epoch 53/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 726us/step - loss: 27.2593 - val_loss: 26.6891 Epoch 54/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 732us/step - loss: 26.4221 - val_loss: 25.8409 Epoch 55/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 740us/step - loss: 25.5427 - val_loss: 25.0096 Epoch 56/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 746us/step - loss: 24.8449 - val_loss: 24.1942 Epoch 57/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 774us/step - loss: 23.9737 - val_loss: 23.3993 Epoch 58/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 759us/step - loss: 23.0761 - val_loss: 22.6121 Epoch 59/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 762us/step - loss: 22.2593 - val_loss: 21.8457 Epoch 60/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 763us/step - loss: 21.5584 - val_loss: 21.0942 Epoch 61/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 761us/step - loss: 20.8564 - val_loss: 20.3633 Epoch 62/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 765us/step - loss: 20.2320 - val_loss: 19.6400 Epoch 63/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 776us/step - loss: 19.3908 - val_loss: 18.9421 Epoch 64/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 782us/step - loss: 18.7075 - val_loss: 18.2533 Epoch 65/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 761us/step - loss: 18.0204 - val_loss: 17.5802 Epoch 66/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 830us/step - loss: 17.4017 - val_loss: 16.9253 Epoch 67/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 787us/step - loss: 16.7196 - val_loss: 16.2821 Epoch 68/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 750us/step - loss: 16.1802 - val_loss: 15.6558 Epoch 69/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 790us/step - loss: 15.5185 - val_loss: 15.0433 Epoch 70/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 767us/step - loss: 14.8515 - val_loss: 14.4478 Epoch 71/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 721us/step - loss: 14.3270 - val_loss: 13.8643 Epoch 72/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 718us/step - loss: 13.7105 - val_loss: 13.2988 Epoch 73/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 713us/step - loss: 13.2021 - val_loss: 12.7420 Epoch 74/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 726us/step - loss: 12.6565 - val_loss: 12.2037 Epoch 75/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 750us/step - loss: 12.0777 - val_loss: 11.6787 Epoch 76/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 727us/step - loss: 11.6135 - val_loss: 11.1660 Epoch 77/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 736us/step - loss: 11.0596 - val_loss: 10.6707 Epoch 78/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 729us/step - loss: 10.5589 - val_loss: 10.1859 Epoch 79/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 729us/step - loss: 10.0412 - val_loss: 9.7176 Epoch 80/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 726us/step - loss: 9.6421 - val_loss: 9.2591 Epoch 81/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 735us/step - loss: 9.1733 - val_loss: 8.8159 Epoch 82/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 719us/step - loss: 8.7153 - val_loss: 8.3875 Epoch 83/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 722us/step - loss: 8.3182 - val_loss: 7.9693 Epoch 84/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 714us/step - loss: 7.8751 - val_loss: 7.5651 Epoch 85/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 725us/step - loss: 7.4688 - val_loss: 7.1756 Epoch 86/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 728us/step - loss: 7.0620 - val_loss: 6.7965 Epoch 87/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 730us/step - loss: 6.7240 - val_loss: 6.4305 Epoch 88/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 777us/step - loss: 6.3739 - val_loss: 6.0780 Epoch 89/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 779us/step - loss: 6.0006 - val_loss: 5.7369 Epoch 90/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 771us/step - loss: 5.6543 - val_loss: 5.4080 Epoch 91/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 774us/step - loss: 5.3181 - val_loss: 5.0927 Epoch 92/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 770us/step - loss: 5.0447 - val_loss: 4.7870 Epoch 93/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 767us/step - loss: 4.7329 - val_loss: 4.4951 Epoch 94/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 751us/step - loss: 4.4401 - val_loss: 4.2132 Epoch 95/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 778us/step - loss: 4.1466 - val_loss: 3.9434 Epoch 96/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 750us/step - loss: 3.8473 - val_loss: 3.6851 Epoch 97/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 773us/step - loss: 3.6259 - val_loss: 3.4371 Epoch 98/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 760us/step - loss: 3.3756 - val_loss: 3.2025 Epoch 99/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 794us/step - loss: 3.1554 - val_loss: 2.9755 Epoch 100/100 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 780us/step - loss: 2.9205 - val_loss: 2.7600 95/95 ━━━━━━━━━━━━━━━━━━━━ 0s 544us/step
Model: "sequential_10"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ normalization_4 (Normalization) │ (None, 11) │ 23 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_22 (Dense) │ (None, 1) │ 12 │ └─────────────────────────────────┴────────────────────────┴───────────────┘
Total params: 61 (252.00 B)
Trainable params: 12 (48.00 B)
Non-trainable params: 23 (96.00 B)
Optimizer params: 26 (108.00 B)
CPU times: total: 891 ms Wall time: 9.35 s
%%time
# show the scores
leaderboard_df[:]
CPU times: total: 0 ns Wall time: 0 ns
| mean_squared_error | mean_absolute_error | explained_variance_score | r2_score | |
|---|---|---|---|---|
| untrained_linear | 101.943787 | 9.748023 | 0.049686 | -13.000124 |
| linear | 3.997827 | 1.473956 | 0.011232 | 0.010781 |
| 64_32_16_8_1 | 3.630893 | 1.404292 | 0.302562 | 0.302399 |
| 32_16_8_1 | 3.602257 | 1.385743 | 0.338213 | 0.337592 |
| 16_8_1 | 3.807182 | 1.415440 | 0.280600 | 0.279393 |
| 8_1 | 3.794136 | 1.432214 | 0.228980 | 0.228786 |
| 4_1 | 3.901053 | 1.461622 | 0.178054 | 0.177953 |
| 2_1 | 3.946111 | 1.468480 | 0.044709 | 0.044348 |
| linear_add | 2.769430 | 1.654084 | 0.996007 | 0.669256 |
Linear Model with Output as Additional Input Observations¶
Look at that variance! It's overfitting like a mad crab.
It took only a single linear layer to overfit the model. This is a good example of why we need to be careful about data leakage.
Model as Code¶
...and it's not working. Worth a shot.
%%time
def relu(x):
return np.max(0, x)
def predict(weights, layers, X):
# Initialize input
layer_input = X
# Iterate over layers
for i in range(layers):
# Apply weights to input and add bias
layer_output = np.dot(layer_input, weights[i][0]) + weights[i][1]
# Apply activation function
layer_output = layer_output
# Update input for next layer
layer_input = layer_output
# Return output of final layer as predictions
return layer_output
# get the weights
weights = [layer.get_weights() for layer in all_models['linear_add'].layers]
print(weights)
# predict
preds = predict(weights, 2, output_as_input_df.iloc[0])
print(preds)
[[array([ 1.3032532 , 1.0129274 , 0.34452584, 23.003716 , 10.014996 ,
5.0239677 , 6.6078315 , 0.3064995 , 0.3253052 , 0.36819533,
9.732101 ], dtype=float32), array([8.8651031e-02, 6.0300808e-02, 8.6254207e-03, 1.8041740e+02,
3.7331261e+01, 9.0801477e+00, 1.4141748e+01, 2.1255755e-01,
2.1948172e-01, 2.3262753e-01, 8.4085999e+00], dtype=float32), 0], [array([[-0.20978996],
[ 0.351122 ],
[-0.02942573],
[-0.5934581 ],
[-0.2053751 ],
[ 0.28914857],
[ 0.5067023 ],
[ 0.04240793],
[ 0.04285356],
[ 0.06395935],
[ 2.7610233 ]], dtype=float32), array([8.078017], dtype=float32)]]
[4990.3853]
CPU times: total: 0 ns
Wall time: 1.01 ms
That Doesn't Seem Right¶
Oh well, we tried. Let's move on to the next step.
Onwards to Feature Engineering¶
See the next section for feature engineering.
<html link> for feature reduction.
<localhost html link> for feature reduction.